{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ef2ed242-3561-464c-8d1c-cc3862e23702",
   "metadata": {},
   "source": [
    "# Text Generation via Speculative Decoding using FastDraft and OpenVINO™\n",
    "\n",
    "As model sizes grow, Generative AI implementations require significant inference resources. This not only increases the cost per generation from a prompt, but also increases the power consumption used to serve such requests.\n",
    "\n",
    "Inference optimizations for text generation are essential for reducing costs and power consumption. When optimizing the inference process, the amount of time and energy required to generate text can be significantly reduced. This can lead to cost savings in terms of hardware and software, as well as reduced power consumption. Additionally, inference optimizations can help improve the accuracy of text generation as well as the speed at which it can be generated. This can lead to an improved user experience and increased efficiency in text-generation tasks. In summary, inference optimizations for text generation are essential to reduce costs and power consumption, while also improving the accuracy and speed of text generation.\n",
    "\n",
    "\n",
    "Speculative decoding (or [assisted-generation](https://huggingface.co/blog/assisted-generation#understanding-text-generation-latency)) is a recent technique, that allows to speed up token generation when an additional smaller draft model is used alongside with the main model.\n",
    "\n",
    "Speculative decoding works the following way. The draft model predicts the next K tokens one by one in an autoregressive manner, while the main model validates these predictions and corrects them if necessary. We go through each predicted token, and if a difference is detected between the draft and main model, we stop and keep the last token predicted by the main model. Then the draft model gets the latest main prediction and again tries to predict the next K tokens, repeating the cycle.\n",
    "\n",
    "This approach reduces the need for multiple infer requests to the main model, enhancing performance. For instance, in more predictable parts of text generation, the draft model can, in best-case scenarios, generate the next K tokens that exactly match the target. In that case they are validated in a single inference request to the main model (which is bigger, more accurate but slower) instead of running K subsequent requests. More details can be found in the original [paper](https://arxiv.org/pdf/2211.17192.pdf).\n",
    "\n",
    "![](https://github.com/user-attachments/assets/eb999dea-d98b-42bb-835e-28d3054e1a84)\n",
    "\n",
    "Possibility to achieve significant speedup with Speculative Decoding is highly depends on selection of a high-quality draft model that is both efficient and well-aligned with the target. FastDraft is a novel and efficient approach for pre-training and aligning a draft model to any LLM to be used with speculative decoding, by incorporating efficient pre-training followed by fine-tuning over synthetic datasets generated by the target model. FastDraft was presented in the [paper](https://arxiv.org/abs/2411.11055) at [ENLSP@NeurIPS24](https://neurips2024-enlsp.github.io/accepted_papers.html) by Intel Labs.\n",
    "\n",
    "FastDraft pre-trained draft models achieve impressive results in key metrics of acceptance rate, block efficiency and up to 3x memory bound speed up\n",
    "when evaluated on code completion and up to 2x in summarization, text completion and instruction tasks and unlock large language models inference on AI-PC and other edge-devices.\n",
    "\n",
    "In this tutorial we consider how to apply Speculative decoding using FastDraft and OpenVINO GenAI.\n",
    "\n",
    "<img referrerpolicy=\"no-referrer-when-downgrade\" src=\"https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/speculative-sampling/speculative-sampling.ipynb\" />\n",
    "\n",
    "#### Table of contents:\n",
    "\n",
    "- [Prerequisites](#Prerequisites)\n",
    "- [Prepare models](#Prepare-models)\n",
    "    - [Select inference device](#Select-inference-device)\n",
    "- [Run target model without speculative decoding](#Run-target-model-without-speculative-decoding)\n",
    "- [Run Speculative decoding pipeline](#Run-Speculative-decoding-pipeline)\n",
    "\n",
    "\n",
    "### Installation Instructions\n",
    "\n",
    "This is a self-contained example that relies solely on its own code.\n",
    "\n",
    "We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.\n",
    "For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).\n",
    "\n",
    "<img referrerpolicy=\"no-referrer-when-downgrade\" src=\"https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/speculative-sampling/speculative-sampling.ipynb\" />\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "08aa16b1-d2f6-4a3a-abfb-5ec278133c80",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "\n",
    "First, we should install the [OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai) for running model inference.\n",
    "\n",
    "![](https://media.githubusercontent.com/media/openvinotoolkit/openvino.genai/refs/heads/master/src/docs/openvino_genai.svg)\n",
    "\n",
    "[OpenVINO™ GenAI](https://github.com/openvinotoolkit/openvino.genai) is a library of the most popular Generative AI model pipelines, optimized execution methods, and samples that run on top of highly performant [OpenVINO Runtime](https://github.com/openvinotoolkit/openvino).\n",
    "\n",
    "This library is friendly to PC and laptop execution, and optimized for resource consumption. It requires no external dependencies to run generative models as it already includes all the core functionality (e.g. tokenization via openvino-tokenizers).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfd782ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly huggingface_hub datasets ipywidgets"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "54be999f",
   "metadata": {},
   "source": [
    "## Prepare models\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "As example, we will use already converted LLMs from [OpenVINO collection](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd).\n",
    "You can find OpenVINO optimized FastDraft models can be found in this [collection](https://huggingface.co/collections/OpenVINO/speculative-decoding-draft-models-673f5d944d58b29ba6e94161). You can choose either Phi-3 or Phi-4 to be used for demonstrating FastDraft performance. For Phi-3 we will use [Phi-3-mini-4k-instruct-int4-ov](https://huggingface.co/OpenVINO/Phi-3-mini-4k-instruct-int4-ov) as target model and [Phi-3-mini-FastDraft-50M-int8-ov](https://huggingface.co/OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov) as draft. For Phi-4 we will use [Phi-4-mini-instruct-int4-ov](https://huggingface.co/OpenVINO/Phi-4-mini-instruct-int4-ov) as target model and [Phi-4-mini-FastDraft-120M-int8-ov](https://huggingface.co/OpenVINO/Phi-4-mini-FastDraft-120M-int8-ov) as draft.\n",
    "\n",
    "In case, if you want run own models, you should convert them using [Hugging Face Optimum](https://huggingface.co/docs/optimum/intel/openvino/export) library accelerated by OpenVINO integration. More details about model preparation can be found in [OpenVINO LLM inference guide](https://docs.openvino.ai/2025/openvino-workflow-generative/genai-model-preparation.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61ecfe17",
   "metadata": {},
   "source": [
    "### Select model\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe934261",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b65944bfc43c4acebea6ec5ebd78f981",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Select Model:', options=('Phi-3', 'Phi-4'), value='Phi-3')"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import ipywidgets as widgets\n",
    "\n",
    "model = widgets.Dropdown(\n",
    "    options=[\"Phi-3\", \"Phi-4\"],\n",
    "    value=\"Phi-3\",  # default value\n",
    "    description=\"Select Model:\",\n",
    ")\n",
    "\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccca9ef6",
   "metadata": {},
   "source": [
    "### Download target and draft models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "74bb9f96",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import huggingface_hub as hf_hub\n",
    "\n",
    "if model.value == \"Phi-4\":\n",
    "    target_model_id = \"OpenVINO/Phi-4-mini-instruct-int4-ov\"\n",
    "    draft_model_id = \"OpenVINO/Phi-4-mini-FastDraft-120M-int8-ov\"\n",
    "elif model.value == \"Phi-3\":\n",
    "    target_model_id = \"OpenVINO/Phi-3-mini-4k-instruct-int4-ov\"\n",
    "    draft_model_id = \"OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov\"\n",
    "else:\n",
    "    print(f\"Model {model} is not supported in this demo.\")\n",
    "\n",
    "draft_model_path = Path(draft_model_id.split(\"/\")[-1])\n",
    "target_model_path = Path(target_model_id.split(\"/\")[-1])\n",
    "\n",
    "if not draft_model_path.exists():\n",
    "    hf_hub.snapshot_download(draft_model_id, local_dir=draft_model_path)\n",
    "if not target_model_path.exists():\n",
    "    hf_hub.snapshot_download(target_model_id, local_dir=target_model_path)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "367f84f8-33e8-4ad6-bd40-e6fd41d2d703",
   "metadata": {},
   "source": [
    "### Select inference device\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "\n",
    "Select the device from dropdown list for running inference using OpenVINO.\n",
    "> **Note**: For achieving maximal performance, we recommend to use GPU as target device if it is available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6ddd57de-9f41-403c-bccc-8d3118654a24",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "093ab5068fc542a0826fbd4f8a8d97b8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Device:', options=('CPU', 'GPU'), value='CPU')"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import requests\n",
    "\n",
    "if not Path(\"notebook_utils.py\").exists():\n",
    "    r = requests.get(\n",
    "        url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py\",\n",
    "    )\n",
    "    open(\"notebook_utils.py\", \"w\").write(r.text)\n",
    "\n",
    "from notebook_utils import device_widget\n",
    "\n",
    "device = device_widget(default=\"CPU\", exclude=[\"NPU\", \"AUTO\"])\n",
    "display(device)\n",
    "\n",
    "# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry\n",
    "from notebook_utils import collect_telemetry\n",
    "\n",
    "collect_telemetry(\"speculative-sampling.ipynb\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "53666c13",
   "metadata": {},
   "source": [
    "## Run target model without speculative decoding\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "OpenVINO GenAI provides easy-to-use API for running text generation. Firstly we will create pipeline with `LLMPipeline`. `LLMPipeline` is the main object used for decoding. You can construct it straight away from the folder with the converted model. It will automatically load the `main model`, `tokenizer`, `detokenizer` and default `generation configuration`. \n",
    "After that we will configure parameters for decoding. \n",
    "Then we just run `generate` method and get the output in text format. We do not need to encode input prompt according to model expected template or write post-processing code for logits decoder, it will be done easily with LLMPipeline. \n",
    "\n",
    "To obtain intermediate generation results without waiting until when generation is finished, we will write streamer function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "553148f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import openvino_genai as ov_genai\n",
    "import time\n",
    "\n",
    "print(\"Initializing Auto-Regressive pipeline...\")\n",
    "pipe = ov_genai.LLMPipeline(target_model_path, device.value)\n",
    "\n",
    "config = ov_genai.GenerationConfig()\n",
    "config.max_new_tokens = 200\n",
    "config.apply_chat_template = False\n",
    "\n",
    "prompt = '''<s>\n",
    "\n",
    "def prime_fib(n: int):\n",
    "    \"\"\"\n",
    "    prime_fib returns n-th number that is a Fibonacci number and it's also prime.\n",
    "    >>> prime_fib(1)\n",
    "    2\n",
    "    >>> prime_fib(2)\n",
    "    3\n",
    "    >>> prime_fib(3)\n",
    "    5\n",
    "    >>> prime_fib(4)\n",
    "    13\n",
    "    >>> prime_fib(5)\n",
    "    89\n",
    "    \"\"\"'''\n",
    "\n",
    "\n",
    "def streamer(subword):\n",
    "    print(subword, end=\"\", flush=True)\n",
    "    # Return flag corresponds whether generation should be stopped.\n",
    "    # False means continue generation.\n",
    "    return False\n",
    "\n",
    "\n",
    "start_time = time.perf_counter()\n",
    "pipe.generate(prompt, config, streamer=streamer)\n",
    "end_time = time.perf_counter()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c40d9901-ceb2-4c4c-a686-303590292ab3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import gc\n",
    "\n",
    "print(f\"Generation time: {end_time - start_time:.2f}s\")\n",
    "del pipe\n",
    "gc.collect()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "27a01739-1363-42ef-927f-6a340bdbe7ba",
   "metadata": {},
   "source": [
    "## Run Speculative decoding pipeline\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "To enable Speculative decoding in `LLMPipeline,` we should additionally provide the `draft_model` structure and `SchedulerConfig` for resource management. \n",
    "\n",
    "![](https://github.com/user-attachments/assets/69f5c096-abca-4f97-952b-291c52eb3444)\n",
    "\n",
    "\n",
    "As shown in the figure above, speculative decoding works by splitting the generative process into two stages. In the first stage, a fast, but less accurate draft model (AKA assistant) autoregressively generates a sequence of tokens. In the second stage, a large, but more accurate target model conducts parallelized verification over the generated draft tokens. This process allows the target model to produce multiple tokens in a single forward pass and thus accelerate autoregressive decoding. The success of speculative decoding largely hinges on the speculation lookahead (SL), i.e. the number of tokens produced by the draft model in each iteration.  The straightforward method, based on [Leviathan et al.](https://arxiv.org/pdf/2211.17192), uses a static value of the speculation lookahead and involves generating a constant number of candidate tokens at each speculative iteration. You can adjust the number of candidates using `num_assistant_tokens` parameter in generation config. If the assistant model's confidence in its prediction for the current token is lower than this threshold, the assistant model stops the current token generation iteration is not yet reached."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9fde1b3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "draft_model = ov_genai.draft_model(draft_model_path, device.value)\n",
    "config.num_assistant_tokens = 5\n",
    "\n",
    "print(\"Initializing Speculative-Decoding pipeline...\")\n",
    "pipe = ov_genai.LLMPipeline(target_model_path, device.value, draft_model=draft_model)\n",
    "\n",
    "start_time = time.perf_counter()\n",
    "result = pipe.generate(prompt, config, streamer=streamer)\n",
    "end_time = time.perf_counter()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9739752-0bd8-4be7-a4cc-c076228bfc91",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Generation time: {end_time - start_time:.2f}s\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9061f205-862a-450e-a102-4d3ea162f588",
   "metadata": {},
   "source": [
    "Alternative approach, Dynamic Speculative Decoding, described in the [paper](https://arxiv.org/abs/2405.04304) is based on heuristics and adjusts the number of candidate tokens for the next iteration based on the acceptance rate of the current iteration. If all speculative tokens are correct, the number of candidate tokens increases; otherwise, it decreases. For adjusting number of tokens `assistant_confidence_threshold` parameters should be used. If the assistant model's confidence in its prediction for the current token is lower than this threshold, the assistant model stops the current token generation iteration, even if the number of `num_assistant_tokens` is not yet reached.  You can find more details in this [blog post](https://huggingface.co/blog/dynamic_speculation_lookahead). This approach has advantages for cases, when optimal number of tokens for draft model is unknown and draft model has low acceptance rate.\n",
    "\n",
    ">*Note*: For small and fast draft models like FastDraft, you may not see benefit for dynamic speculative decoding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9c011ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "config.num_assistant_tokens = 0\n",
    "config.assistant_confidence_threshold = 0.05\n",
    "\n",
    "start_time = time.perf_counter()\n",
    "result = pipe.generate(prompt, config, streamer)\n",
    "end_time = time.perf_counter()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5803b7c-b38b-474d-9604-363e3813b6b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Generation time: {end_time - start_time:.2f}s\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fd59ed90",
   "metadata": {},
   "source": [
    "## Evaluate Speculative Decoding on multiple examples"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "88772975-bce3-49c8-bae7-28cd3d1d44e1",
   "metadata": {},
   "source": [
    "Configure the data type and the number of examples to run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "64a36dc1-958c-4f7e-baba-efa89a2d9a8f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ac15c0165c8e4a1c92ad8c1da9bc7879",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Data type:', options=('Code', 'Text'), value='Code')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num_samples_to_select = 50\n",
    "\n",
    "import ipywidgets as widgets\n",
    "\n",
    "data_options = [\"Code\", \"Text\"]\n",
    "data_type = widgets.Dropdown(\n",
    "    options=data_options,\n",
    "    value=data_options[0],\n",
    "    description=\"Data type:\",\n",
    "    disabled=False,\n",
    ")\n",
    "data_type"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8f3486cd",
   "metadata": {},
   "source": [
    "Load the dataset and prepare the prompts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "13f03634",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "loading dataset...\n",
      "Done\n"
     ]
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "print(\"loading dataset...\")\n",
    "\n",
    "if data_type.value == \"Code\":\n",
    "    ds = load_dataset(\"openai_humaneval\", split=\"test\")\n",
    "    prompts = ds[\"prompt\"]\n",
    "    prompts = [\"<s>\" + prompts[i] for i in range(num_samples_to_select)]\n",
    "else:\n",
    "    ds = load_dataset(\"abisee/cnn_dailymail\", \"3.0.0\", split=\"test\")\n",
    "    prompts = ds[\"article\"]\n",
    "    prompts = [\n",
    "        \"<|user|> ###\\nArticle: \" + prompts[i] + \"\\n\\nSummarize the above article in 5 sentence.\\n<|end|><|assistant|>\" for i in range(num_samples_to_select)\n",
    "    ]\n",
    "print(\"Done\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "be4e20d6",
   "metadata": {},
   "source": [
    "Run auto-regressive generation and get total runtime per example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f4ea9e5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Initializing Auto-Regressive pipline...\n",
      "Running Warmup...\n",
      "Running Auto-Regressive generation...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [04:29<00:00,  5.40s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "27"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import openvino_genai as ov_genai\n",
    "import time\n",
    "from tqdm import tqdm\n",
    "\n",
    "print(\"Initializing Auto-Regressive pipline...\")\n",
    "pipe = ov_genai.LLMPipeline(target_model_path, device.value)\n",
    "\n",
    "config = ov_genai.GenerationConfig()\n",
    "config.apply_chat_template = False\n",
    "config.max_new_tokens = 330\n",
    "if data_type.value == \"Code\":\n",
    "    config.max_new_tokens = 128\n",
    "\n",
    "# warmup\n",
    "print(\"Running Warmup...\")\n",
    "for i in range(10):\n",
    "    pipe.generate(\"this is a warmup prompt\", config)\n",
    "\n",
    "print(\"Running Auto-Regressive generation...\")\n",
    "times_auto_regressive = []\n",
    "for prompt in tqdm(prompts):\n",
    "    start_time = time.perf_counter()\n",
    "    result = pipe.generate(prompt, config)\n",
    "    end_time = time.perf_counter()\n",
    "    times_auto_regressive.append(end_time - start_time)\n",
    "print(\"Done\")\n",
    "\n",
    "import gc\n",
    "\n",
    "del pipe\n",
    "gc.collect()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "35dbba92",
   "metadata": {},
   "source": [
    "Now run generation with speculative-decoding:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d73e9f37",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Initializing Speculative-Decoding pipline...\n",
      "Running Warmup...\n",
      "Running Speculative Decoding generation...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [04:10<00:00,  5.01s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "config.num_assistant_tokens = 5\n",
    "draft_model = ov_genai.draft_model(draft_model_path, device.value)\n",
    "\n",
    "print(\"Initializing Speculative-Decoding pipline...\")\n",
    "pipe = ov_genai.LLMPipeline(target_model_path, device.value, draft_model=draft_model)\n",
    "\n",
    "# warmup\n",
    "print(\"Running Warmup...\")\n",
    "for i in range(10):\n",
    "    pipe.generate(\"this is a warmup prompt\", config)\n",
    "\n",
    "times_speculative_decoding = []\n",
    "print(\"Running Speculative Decoding generation...\")\n",
    "for prompt in tqdm(prompts):\n",
    "    start_time = time.perf_counter()\n",
    "    result = pipe.generate(prompt, config)\n",
    "    end_time = time.perf_counter()\n",
    "    times_speculative_decoding.append((end_time - start_time))\n",
    "print(\"Done\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a0f4da9c",
   "metadata": {},
   "source": [
    "Now let's calculate the speedup:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ad898772",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "average speedup: 1.09\n"
     ]
    }
   ],
   "source": [
    "avg_speedup = sum([x / y for x, y in zip(times_auto_regressive, times_speculative_decoding)]) / len(prompts)\n",
    "print(f\"average speedup: {avg_speedup:.2f}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  },
  "openvino_notebooks": {
   "imageUrl": "https://github.com/user-attachments/assets/eb999dea-d98b-42bb-835e-28d3054e1a84",
   "tags": {
    "categories": [
     "Model Demos"
    ],
    "libraries": [],
    "other": [
     "LLM"
    ],
    "tasks": [
     "Text Generation"
    ]
   }
  },
  "vscode": {
   "interpreter": {
    "hash": "cec18e25feb9469b5ff1085a8097bdcd86db6a4ac301d6aeff87d0f3e7ce4ca5"
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {
     "c09eb6c800744d31bd23e38d33a82b0a": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "d4e65aeb9fd243c99022f6dede35f3c0": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "DescriptionStyleModel",
      "state": {
       "description_width": ""
      }
     },
     "e83ffbfc2136400194e2b1da63bccb26": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "DropdownModel",
      "state": {
       "_options_labels": [
        "CPU"
       ],
       "description": "Device:",
       "index": 0,
       "layout": "IPY_MODEL_c09eb6c800744d31bd23e38d33a82b0a",
       "style": "IPY_MODEL_d4e65aeb9fd243c99022f6dede35f3c0"
      }
     }
    },
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
